arXiv: 1507.07114vl [stat.ME] 25 Jul 2015 


Improved initialisation of model-based clustering using 
Gaussian hierarchical partitions 


Luca Scrucca Adrian E. Raftery 

Universita degli Studi di Perugia University of Washington * 


July 28, 2015 


Abstract 

Initialisation of the EM algorithm in model-based clustering is often crucial. Various 
starting points in the parameter space often lead to different local maxima of the likelihood 
function and, so to different clustering partitions. Among the several approaches avail¬ 
able in the literature, model-based agglomerative hierarchical clustering is used to provide 
initial partitions in the popular mclust R package. This choice is computationally con¬ 
venient and often yields good clustering partitions. However, in certain circumstances, 
poor initial partitions may cause the EM algorithm to converge to a local maximum of 
the likelihood function. We propose several simple and fast refinements based on data 
transformations and illustrate them through data examples. 
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1 Introduction 


Model-based clustering is an increasing popular method for unsupervised learning. In con¬ 
trast to classical heuristic methods, such as fc-means and hierarchical clustering, model-based 
clustering methods rely on a probabilistic assumption about the data distribution. According 
to the main underlying assumption, data are generated from a mixture distribution, where 
each cluster is described by one or more mixture components. Maximum likelihood estima¬ 


tion of parameters is usually carried out via the EM algorithm (Dempster et al, 1977). The 
EM algorithm is an iterative, strictly hill-climbing procedure whose performance can be very 
sensitive to the starting point because the likelihood surface tends to have multiple modes, 
although it usually produces sensible results when started from reasonable starting values. 
Thus, good initialisation is crucial for finding MLEs, although no method suggested in the 
literature uniformly outperforms the others. 


In the case of Gaussian model-based clustering (Eraley and Raftery 2002), several ap¬ 
proaches are available, both stochastic and deterministic, for selecting an initial partition of 


the observations, or an initial estimate of the parameters. In the mclust R package (Fra¬ 


ley et al, 2012, 2015), the EM algorithm is initialised using the partitions obtained from 


model-based agglomerative hierarchical clustering. Efficient numerical algorithms exist for 
approximately maximise the classification likelihood with multivariate normal models. How¬ 
ever, in certain circumstances, poor initial partitions may cause the EM algorithm to converge 
to a local maximum of the likelihood function. 

In this contribution we discuss cases where an initial partition may lead to sub-optimal 
maximum likelihood estimates when applied to coarse data with ties (e.g. discrete data or 
continuous data that are rounded in some form when measured), and we present some possible 
refinements to improve the htting of such finite mixture models. 

The outline of this article is as follows. Section gives a brief review of background 
material on model-based clustering, with special attention devoted to some of the proposals 
available in the literature for the initialisation of EM algorithm. Section discusses the 
model-based hierarchical agglomerative clustering method used for starting the EM algorithm. 
This is a very convenient and efficient algorithm, but in certain circumstances presents a 
serious drawback. Section [^contains some simple transformation-based methods to refine the 
EM initialisation step derived from model-based agglomerative hierarchical clustering. The 
behaviour of these methods is illustrated through the use of real data examples in Section 
The hnal section provides some concluding remarks. 
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2 Background 

2.1 Model-based clustering overview 


Let xi,X 2 , ■ ■. fXn be a sample of n independent identically distributed observations. The 
distribution of every observation is specified by a probability mass or density function through 
a finite mixture model of G components, which takes the following form 

G 

= '^TTkfk{x-,ek), ( 1 ) 

k=l 


where = {vri,..., ttg-i, , do} are the parameters of the mixture model, fk{x] Ok) is 

the A:th component density at x with parameter(s) Ok, (vri,..., ttg-i) are the mixing weights 
or probabilities (such that tta, > 0, Ylk=i = 1); and G is the number of mixture components. 

Assuming G fixed, mixture model parameters T' are usually unknown and must be esti¬ 
mated. The log-likelihood function corresponding to equation ([^ is given by xi,..., x^) = 
[xi] ’!'))• Direct maximisation of the log-likelihood function is often complicated, 


so MLE of hnite mixture models is usually carried out via the EM algorithm (McLachlan and 


Peel, 2000). 


In the model-based approach to clustering, each component of a finite mixture of density 
functions belonging to a given parametric class is associated with a group or cluster. Most 
applications assume that all component densities arise from the same parametric distribution 
family, although this need not be the case in general. A popular model assumes a Gaussian 
distribution for each component, i.e. fk{x;0k) ~ ^{^lk■,'^k)■ Thus, clusters are ellipsoidal, 
centred at the mean vector //^,, and with other geometric features, such as volume, shape and 
orientation, determined by Parsimonious parameterisations of the covariance matrices 
can be dehned by means of eigen-decomposition in the form = 'jkOkAkOj, where 'jk is 
a scalar controlling the volume of the ellipsoid, Ak is a diagonal matrix specifying the shape 
of the density contours, and Ok is an orthogonal matrix which determines the orientation of 


the corresponding ellipsoid (Banfield and Raftery, 1993; Celeux and Govaert, 1995). Eraley 


et al (2012, Table 1) summarized some parameterisations of within-group covariance matrices 


available in the mclust software, and the corresponding geometric characteristics. 

The number of mixture components and the parameterisation of the component covariance 
matrices can be selected on the basis of model selection criteria, such as the Bayesian informa¬ 


tion criterion (BIG; Schwartz, 1978; Eraley and Raftery, 1998) or the integrated complete-data 


likelihood criterion (IGL; Biernacki et al, 2000). 
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2.2 Initialisation of EM algorithm 

The EM algorithm is an easy to implement, numerically stable algorithm, which has, under 
fairly general conditions, reliable global convergence. However, it may converge slowly and, 
like any other Newton-type method, does not guarantee convergence to the global maximum 


when there are multiple maxima (McLachlan and Krishnan, 2008, p. 29). Further, in the 
case of finite mixture modelling, the estimates obtained depends on the starting values. Thus 
initialisation of EM is crucial because the likelihood surface tends to have multiple modes, 
although it usually produces sensible results when started from reasonable starting values 


(Wu, 1983; Everitt et al, 2011, p. 150). 

Several approaches are available, both stochastic and deterministic, for initialising the 
EM algorithm. Broadly speaking, there are two general approaches. The first one starts 
from some initial values for the parameters to be estimated. A simple strategy is based 
on generating several candidates by drawing parameter values uniformly at random over 
the feasible parameters regions. Since the random-starts strategy has a fair chance of not 
providing good initial starting values, a common suggestion to alleviate this problem is to 
run the EM algorithm with several random starts and to choose the best solution. However, 
such a strategy can be quite time consuming and is not always practical, especially for high¬ 
dimensional datasets. 

Two other stochastic initialisation schemes are the so-called emEM and rndEM. The 


former approach, proposed by Biernacki et al (2003), uses several short runs of the EM 
initialised with valid random starts as parameter estimates until an overall number of total 
iterations is exhausted. Then, the solution with the highest log-likelihood is chosen to be 
the initialiser for the long EM, which runs until the usual strict convergence criteria are 
met. This approach is computationally intensive and the same comments made above about 
random starts apply to it also. 


Two related approaches were proposed by Maitra (2009), one called Rnd-EM where the 
short EM stage is replaced by choosing multiple starting points and evaluating the log- 
likelihood at these values without running any EM iterations. Then, the best obtained solution 
serves as an initialiser for the long EM stage. The second proposal is a staged approach based 
on finding a large number of local modes of the dataset, and then to choose representatives 
from the most widely-separated ones. This approach is reported to be very time-consuming 


for high-dimensional data, and Melnykov and Maitra (2010) found the method to be outper¬ 


formed by emEM and RndEM. Recently, Melnykov and Melnykov (2012) have proposed a 
strategy for initialising mean vectors by choosing points with higher concentrations of neigh¬ 
bours and using a truncated normal distribution for the preliminary estimation of dispersion 
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matrices. 


A second kind of approach for initialising the EM algorithm is based on the partition 
obtained from another clustering algorithm, e.g. A:-means or hierarchical agglomerative clus¬ 
tering (HAC). In this case, the final classification is used to start the EM algorithm from the 
M-step. Unfortunately, most of these partitioning algorithm have several drawbacks, such 
as the need to be properly initialised or the tendency to impose specific shapes and pat¬ 
terns on clusters. In the popular mclust package for R, the EM algorithm is initialised using 
the partitions obtained from model-based hierarchical agglomerative clustering (MBHAC). 
In this approach, k clusters are obtained from a large number of smaller clusters by recur¬ 
sively merging the two clusters that have the smallest dissimilarity in a model-based sense, 
i.e. the dissimilarity used for agglomeration is derived from a probabilistic model. Banfield 


and Raftery (1993) proposed a dissimilarity based on a Gaussian mixture model, which is 


equal to the decrease in likelihood resulting by the merging of two clusters. Eraley (1998) 
showed how the structure of some specific Gaussian models can be exploited to yield efficient 
algorithms for agglomerative hierarchical clustering. More details about this approach are 
discussed in the following section. 


3 Model-based hierarchical agglomerative clustering 


The objective of MBHAG is to obtain a hierarchical ordering of clusters of n objects on the 
basis of the some measure of similarity among them. The result is a tree-like structure, which 
proceeds from n clusters containing one object to one cluster containing all n objects by 
successively merging objects and clusters. 

Given n objects or observations (aii,..., let (zi,..., Zn)~^ denote the classification 
labels, i.e. Zi = k if Xi is assigned to cluster k. The unknown parameters are obtained by 
maximising the classification likelihood 

n 

(^1 ) • • • ) ^li ■ • • 1 ^n\^li ■ • • 1 ®n) — fZi | ^ Zi ) • 

i=l 


Assuming the multivariate Gaussian distribution for the parameters are the mean 

vector and the covariance matrix By imposing different covariance structures over 


Sfcj different criterion can be derived (see Eraley, 1998, Table 1). For instance, when the 
covariance is allowed to be different among clusters, the criterion to be minimised at each 
hierarchical stage is 


G 

Wg = '^nk log 

fc=i 


nk 


( 2 ) 
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where is the number of observations in group k and Wk is the sample cross-product matrix 
for the kth. group {k = Efficient numerical algorithms are available for approxi¬ 

mately maximising the classification likelihood (or, equivalently, minimising the corresponding 
criterion) with multivariate Gaussian models have been discussed. 

As mentioned, the above MBHAC approach is used for starting the EM algorithm in 
the mclust R package. This is particularly convenient because the underlying probabilistic 
model can be shared by the initialisation step and the model fitting step. MBHAC is also 
computationally convenient because a single run provides the basis for initialising the EM 
algorithm for any number of mixture components and parameterisations of the component 
covariance matrices. 

However, a serious problem for the MBHAC approach may arise when, at any stage, two 
pairs of objects attain the same minimum value for the criterion in Q. In the presence of 
coarse data, resulting from the discrete nature of the data or from continuous data that are 
rounded in some way when measured, ties must be broken by choosing the pair of entities that 
will be merged. This is often done at random but, regardless of which method is adopted for 
breaking ties, this choice can have important consequences because it changes the clustering 
of the remaining observations. In this case the hnal EM solution may depend on the ordering 
of the variables, and to a lesser extent on permutation of the observations (the latter case is 
not studied further in this paper). 

This difficulty is known as the ties in proximity problem in the hierarchical clustering 


literature (see, for example, Jain and Dubes, 1988 Sec. 3.2.6). This problem can also arise in 


other contexts, such as /c-means clustering (Gordon 1999 p. 42) or partition around medoids 


(PAM; Kaufman and Rousseeuw, 1990, p. 104). 


4 Transformation-based approaches for obtaining starting par¬ 
titions in model-based hierarchical agglomerative clustering 

In this section we describe some simple proposals for starting the EM algorithm using the 
partitions obtained with MBHAC. Ideally, we would like to retain the positive aspects of such 
approach, but, at the same time, reduce the chance that a poor initial partition causes the 
EM algorithm to converge to a local maximum of the likelihood function. 

The idea is to project the data through a suitable transformation before applying the 
MBHAC at the initialisation step. Once a reasonable hierarchical partition is obtained, the 
EM algorithm is run using the data on the original scale. 

Let X be the (n x p) data matrix, and X = {X — be the corresponding centred 

matrix, where x = (xi,... ,Xp)'^ is the vector of sample means, and In is the unit vector of 
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length n. Let S = {s^} = X''~X/n be the {p x p) sample covariance matrix. Consider the 
singular value decomposition (SVD) of the centred data matrix, 

r 

X = UDV^ = ^ XiUiv], 

i=\ 

where Ui are the right singular vectors, u, the left singular vectors, Ai > A 2 > ... > A,. > 0 
the singular values, and r the rank of matrix X. Similarly, the centred and scaled data matrix 
can be decomposed as 

r 

= U*D*V*^ = ^\*u*vf, 

i=l 

where S = diag(sf,..., Sp) is the diagonal matrix of sample variances. In both SVD decom¬ 
positions the rank r is the rank of the data matrix, i.e. r < min(n,p), with equality when 
there are no singularities. 

We now provide the details of the transformations investigated and some remarks. 

4.1 Data standardisation 

The first data transformation which has been investigated is the simple standardisation of 
variables (STD): 

Zstd ^ X5-V2, 

for which E(Zstd) = 0 and Var(ZsTD) = Cor(Z’sTD), he. the features are centred at zero, 
with unit variances and correlated. This transformation only shifts the data at the origin 
and scale the variables to a common unit of measure, while preserving the data distribution 
shape. 

4.2 Data sphering 

Sphering (or whitening) the data (SPH) is obtained by applying the following transformation: 

ZsPH ^ XyD-^Vn = U^/n, 

where V and D~^^/n = diag(y^/Ai) are, respectively, the matrix of eigenvectors and the 
diagonal matrix of square root inverse of eigenvalues from the spectral decomposition of the 
sample marginal covariance, S. For this transformation, E(.Zsph) = 0 and Var(.ZspH) = 
I, so the features are centred at zero, with unit variances and uncorrelated. Thus, this 
transformation converts an elliptically shaped symmetric cloud of points into a spherically 
shaped cloud. 
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4.3 PCA scores from covariance matrix 

The principal component transformation from the covariance matrix (PCS) is obtained as: 


Zpcs ^ XP = UD, 


for which E(Zpcs) = 0 and Var(Zpcs) = jn = diag(A?/n), so the features are centred, 
uncorrelated and with decreasing variances equal to the eigenvalues of S. Usually the first 
few components account for most of the dispersion in the data. A similar idea was proposed 


by McLachlan (1988), who discussed the use of principal component analysis in a preliminary 
exploratory analysis for the selection of suitable starting values. 


4.4 PCA scores from correlation matrix 

The principal component transformation from the correlation matrix (PCR) is defined as: 

ZpcR ^ XS~i/V* = U*D\ 

for which E(Zpcr) = 0 and Var(ZpcR) = D*"^ jn = diag(A*^/n), so the features are centred, 
uncorrelated and with decreasing variances equal to the eigenvalues of the marginal sample 
correlation matrix, 


4.5 Scaled SVD projection 

The scaled SVD transformation (SVD) is computed as: 

ZsvD ^ 

for which E(Zsvd) = 0 and Var(ZgvD) = D*/n = diag(A*/n). Again the features are 
centred, uncorrelated and with decreasing variances equal to the square root of the eigenvalues 
of the marginal sample correlation matrix, . In this case the features’ dispersion 

presents a gentle decline compared to the PCR case. 

4.6 Remarks 

All the above transformations, except for STD, allow one to remove the “ordering” effect, so 
that even in the presence of ties the partitions obtained by applying MBHAC are invariant 
to permutations of the input variables. Furthermore, as shown in the next section through 
examples with real data, most of them allow one to achieve better clustering results when 
used for initialising the EM algorithm in Gaussian model-based clustering. 

To get an idea of this, consider Figure where we report a scatterplot matrix with pairs 
of plots for the original variables in the Crabs dataset (to be discussed in Section [^1) and for 
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scaled SVD-transformed features in, respectively, the lower and upper panels, and with points 
marked according to the true classes. In the graphs for the variables in the original scale is 
hard to detect any clustering structure, whereas in the first three SVD features there appears 
to be a certain degree of separation among classes. Applying MBHAC on this features space 
should provide better starting points for applying the EM algorithm. 


- 0.4 - 0.1 0.1 0.3 - 0.3 - 0.1 0.1 - 0.2 0.0 0.2 - 0.10 0.00 0.10 



10 15 20 10 15 20 10 15 20 10 15 20 


Figure 1: Scatterplot matrix for the Crabs data: lower panels show scatterplots for pairs of 
variables in the original scale; upper panels show the features obtained by applying the scaled 
SVD transformation. For all graphs points are marked according to the true classes. 
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5 Data analyses 


In this Section we present some examples using real data. We compare the behaviour of 
the proposed transformations against the usual MBHAC for starting the EM algorithm, and 
two common strategies for GMMs initialisation. The first strategy is to start from the best 
partition obtained out of 50 runs of the /c-means algorithm. The second is the emEM strategy 


as implemented in the mixmod software (Biernacki et al, 2006; Auder et al, 2014), which, by 
default, uses 50 short runs of EM, each made of 5 iterations, followed by a long run of EM from 
the solution maximising the log-likelihood. The tolerance used for assessing the log-likelihood 
relative convergence of the EM algorithm is set to 10“®. 

The comparison among the different initialising strategies is based on two measures, the 
BIG and the adjusted Rand index (ARI). The former is used to select the best Gaussian 
finite mixture model with respect to both the number of components and the component- 
covariances decomposition. The BIG for model A4 with k components has the following 
form 

= 2£{^;x) - UM,klog{n), 

where x) is the maximised log-likelihood, iyM,k is the number of independent parameters 
to be estimated in model M, and k is the number of mixture components. This criterion 
depends on the starting partition through the log-likelihood at the MLEs penalised by 
the complexity of the model. It is the default criterion used in mclust for selecting a model, 
so the larger the value of the BIG the stronger the evidence for the corresponding model and 
number of components. 


The ARI (Hubert and Arable, 1985) is used for evaluating the clustering obtained with a 
given mixture model. This is a measure of agreement between two partitions, one estimated 
by a statistical procedure independent of the labelling of the groups, and one being the true 
classification. The ARI has zero expected value in the case of a random partition, and it is 
bounded above by 1, with higher values representing better partition accuracy. Furthermore, 
it can be applied to compare partitions having different numbers of parts. The ARI is the 


index recommended by Milligan and Gooper (1986) for measuring the agreement between 
an external reference partition and a clustering partition. In the following data analysis 
examples, we take advantage of the knowledge of the true classes for measuring clustering 
accuracy, but not for model fitting. 
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5.1 Crabs data 


We now consider a dataset consisting of data on five morphological measurements for 200 
Leptograpsus crabs, with 50 crabs for each of two colour forms (blue and orange) and both 
sexes. Figure shows the marginal distribution for each variable. In each stripchart data 
points are stacked to avoid overalapping, so the presence of several ties in the data is evident. 
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Figure 2: Stripcharts with stacked data points for the Crabs data showing the marginal 
distribution for each variable. From this plot the presence of several ties is clearly visible. 


Overall, there are 5! = 120 possible different ordering of the variables, of which 105 lead to 
selection of the (FEE,9) model, and 15 the (EEV,3) model. Table shows the corresponding 
BIC and ARI obtained for these models and for models initialised using the transformations 
discussed in Section As it can be seen, initialisation with the scaled SVD transformation 
returns the model with both the largest BIC and the most accurate partition (ARI = 0.7938). 
This transformation is the only one that selects the correct number of mixture components. 
Initialisation based on /c-means also yields the right number of components, but its fit and 
accuracy are substantially worse. The emEM initialisation strategy yields a better fit, but it 
selects the wrong number of clusters. 
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Table 1: Results from different initialisation strategies for the Crabs data. All possible 
orderings of the variables are reported for the default MBHAC initialisation; values among 
parentheses provide the number of such orderings which converge to the same solution. System 
times are averaged over 10 runs of the estimation algorithm. 


Initialisation 

M 

k 

BIC 

ARI 

System time 

MBHAC 
Default (105) 

EEE 

9 

-2883.68 

0.4831 

2.61 s 

(15) 

EEV 

3 

-2925.59 

0.4715 

2.47 s 

SPH 

EEE 

8 

-2861.27 

0.5248 

2.40 s 

PCS 

EEE 

9 

-2883.68 

0.4831 

2.69 s 

PCR 

EEE 

10 

-2894.65 

0.4572 

2.72 

SVD 

EEV 

4 

-2842.30 

0.7938 

1.88 s 

fe-means 

EEV 

4 

-2916.68 

0.5926 

2.78 s 

emEM 

EEE 

6 

-2866.61 

0.6305 

11.08 s 


Raftery and Dean (2006) selected as optimal subset for clustering purposes the variables 


(FL, RW, CW, BD) (in the identified ordering). Also for such a subset, different orderings give 
different results in 7 out of 24 possible arrangements. Table shows the clustering results 
obtained with different initialisation strategies using the above mentioned optimal subset of 
variables. Again, initialisation with the scaled SVD transformation yields the largest BIC 
and the highest ARI among the considered strategies, with a slight improvement over the 
default initialisation with the identified ordering of the variables. Note that emEM has results 
analogous to the best default MBHAC initialisation, but A:-means results are markedly worse. 

Finally, note that, using both the full set of the variables and the optimal subset, analysis 
with the initialisation using the scaled SVD transformation used the smallest computing 
time. This may appear counterintuitive at first sight, because computing the SVD is time 
consuming. However, this is more than balanced by the fact that having good starting values 
allows the EM algorithm to converge in fewer steps. 


5.2 Flea beetles data 

This dataset provides a sample of 72 flea beetles from three species. We consider the cluster¬ 
ing using the six physical measurements available, namely (tarsi, tars2, head, aedel, 
aede2, aedeS). 

There are 6! = 720 different orderings of the variables. A huge amount of variability is 
obtained by using the default initialisation: about half the times the (EEE,3) model is selected, 
but also models with 4, 5, and 6 components are chosen in a non negligible number of cases. 
Even when the same pair of covariance matrix decomposition and number of components 
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Table 2: Results from different initialisation strategies for the Crabs data using the optimal 
subset of variables identified by Raftery and Dean (2006). All possible orderings of the 
variables are reported for the default MBHAC initialisation; values among parentheses provide 
the number of such orderings which converge to the same solution. System times are averaged 
over 10 runs of the estimation algorithm. 


Initialisation 

M 

k 

BIC 

ARI 

System time 

MBHAC 






Default (17) 

EEV 

4 

-2609.89 

0.8154 

2.54 s 

(5) 

EEV 

5 

-2645.19 

0.7487 

2.40 s 

(1) 

EEV 

4 

-2609.90 

0.8276 

2.40 s 

(1) 

EEE 

11 

-2674.40 

0.4106 

2.46 s 

SPH 

EEV 

4 

-2609.79 

0.8154 

1.89 s 

PCS 

EEV 

4 

-2609.89 

0.8154 

2.55 s 

PCR 

EEE 

8 

-2652.58 

0.4860 

2.52 s 

SVD 

EEV 

4 

-2609.78 

0.8400 

2.06 s 

/c-means 

EEE 

5 

-2661.07 

0.5608 

2.62 s 

emEM 

EEV 

4 

-2609.87 

0.8154 

8.70 s 


is selected, the estimated models are different. In fact, there are 104 unique BIC values 
(rounding to second decimal point). A picture of these results is shown in the Figure appearing 
in Table with the area of dots proportional to the observed counts. Table contains the 
results for some orderings of the variables, and for the transformation-based initialisations 
discussed in Section From these we can see there are several reasonable starting values 
that lead to the best fitting model, which is also able to recover the true partition. Using 
any transformation-based initialisation, except that based on PCS, allows to recover the true 
partition without requiring more computing time, and removing the uncertainty connected 
with the ordering of variables. 


5.3 Female voles data 


Flury (1997) reported the data for a sample of 86 female voles from two species, 41 from 


Microtus californicus and 45 from Microtus ochrogaster, and seven variables describing various 
body measurements in units of 0.1mm, so several ties are contained in this data set. 

There are 7! = 5,040 possible orderings of the variables. When the default initialisation 
is used, there is considerable variation among the resulting solutions: 41 final models with 
different BIC values (up to 5 signihcant digits) are estimated, leading to 38 different partitions. 
About 67% of the solutions give a single component model, about 20% have three components, 
and only about 7.5% of the solutions correctly identify the correct number of clusters. On 
the contrary, three transformation-based initialisation strategies are able to achieve the best 
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Table 3: Results from different initialisation strategies for the Flea beetles data. Default 
initialisation reports the results obtained with some orderings of the variables; values among 
parentheses provide the number of such orderings which converge to the same pair (M, k). 
System times are averaged over 10 runs of the estimation algorithm. 


Initialisation 

M 

K 

BIC 

ARI System time 

MBHAC 
Default (367) 

FEE 

3 

-2785.57 

1.0000 

0.17 s 


(209) FEE 

4 

-2806.80 

0.9664 

0.21 s 


(123) FEE 

5 

-2810.68 

0.7886 

0.22 s 


(20) 

FEE 

6 

-2838.47 

0.5642 

0.19 s 

STD 


FEE 

3 

-2785.57 

1.0000 

0.13 s 

SPH 


EEE 

3 

-2785.57 

1.0000 

0.15 s 

PCS 


EEE 

4 

-2802.75 

0.7886 

0.17 s 

PCR 


EEE 

3 

-2785.57 

1.0000 

0.12 s 

SVD 


EEE 

3 

-2785.57 

1.0000 

0.17 s 

/c-means 

EEE 

3 

-2785.57 

1.0000 

0.17 s 

emEM 


EEE 

3 

-2785.57 

1.0000 

0.17 s 


fit in term of BIC, which in turn provides the largest ARI. The same optimal solution is also 
achieved by the emEM strategy, whereas results from the fc-means initialisation are inferior. 
Note that also in this case the scaled SVD transformation is among the best strategies, with 
no increase in computing time. 

Table 4: Results from different initialisation strategies for the Female voles data. The results 
for the default MBHAC initialisation are reported for some of all possible orderings of the 
variables; values among parentheses provide the number of such orderings which converge to 
the same pair (Al, k). System times are averaged over 10 runs of the estimation algorithm. 


Initialisation 

M 

K 

BIC 

ARI 

System time 

MBHAC 

Default (3395) 

XXX 

1 

-3874.9 

0.0000 

0.19 s 

(996) 

EEE 

3 

-3862.8 

0.7720 

0.29 s 

(381) 

EEE 

2 

-3844.2 

0.9081 

0.23 s 

(253) 

EEE 

4 

-3872.6 

0.6015 

0.24 s 

SPH 

EEE 

5 

-3871.0 

0.4129 

0.26 s 

PCS 

XXX 

1 

-3874.9 

0.0000 

0.22 s 

PCR 

EEE 

2 

-3844.2 

0.9081 

0.22 s 

SVD 

EEE 

2 

-3844.2 

0.9081 

0.19 s 

/c-means 

EEE 

4 

-3872.6 

0.6015 

0.25 s 

emEM 

EEE 

2 

-3844.2 

0.9081 

2.43 s 
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5.4 Italian wines data 


Forina et al 


reported data on several chemical and physical properties of 178 wines 
grown in the same region in Italy but derived from three different cultivars (Barolo, Grignolino, 
Barbera). We consider 27 of the original 28 variables that were described in the Forina et. al 


paper and are available in the pgmm R package (McNicholas et al, 2015). In what follows, the 
data are analysed on the standardised scale and, given the large number of features, only the 
common full covariance matrix model (FEE) is examined. Table shows the results obtained 
using different MBHAC initial partitions, /c-means and emEM initialisation strategies. Except 
for one case, the MBHAC initialisations allow one to achieve good clustering accuracy. In 
particular, SVD has the highest BIG and attains a clustering that perfectly matches the real 
classes. By comparison, the models initialised by /c-means and emEM show somewhat worse 
values of both BIG and ARE 


Table 5: Results from different initialisation strategies for the Italian wines data. System 
times are averaged over 10 runs of the estimation algorithm. 


Initialisation 

M 

K 

BIG 

ARI 

System time 

MBHAC 

Default 

EEE 

3 

-12309.90 

0.9489 

0.36 s 

SPH 

EEE 

2 

-12370.80 

0.5820 

0.78 s 

PCS 

EEE 

3 

-12309.90 

0.9489 

0.36 s 

PGR 

EEE 

3 

-12309.90 

0.9489 

0.37 s 

SVD 

EEE 

3 

-12306.75 

1.0000 

0.35 s 

/c-means 

EEE 

3 

-12314.72 

0.9637 

0.74 s 

emEM 

EEE 

4 

-12312.04 

0.8355 

3.99 s 


6 Final comments 

The mixture model-based approach to clustering provides a firm statistical framework for 
unsupervised learning. The EM algorithm is typically used for parameter estimation. How¬ 
ever, because of the many local maxima of the likelihood function, an important problem 
for getting sensible maximum likelihood estimates is to obtain reasonable starting points. A 
computationally efficient approach to EM initialisation is based on the partitions obtained 
from agglomerative hierarchical clustering. 

In this paper we have presented and experimented with simple transformation-based meth¬ 
ods to refine the EM initialisation step derived from model-based agglomerative hierarchical 
clustering. The proposed transformation-based strategies allow one to remove the depen- 
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dence on the ordering of the variables when selecting starting partitions. They also often 
lead to improved model fitting and more accurate clustering results. Among the investigated 
transformations, the scaled SVD transformation performed the best in our experiments. The 
proposed approach may be applicable to other mixture modelling contexts, but we have not 
explored this possibility yet. Future studies will be devoted to this aspect. 


As mentioned by Biernacki et al (2003, p. 567) and Melnykov and Maitra (2010), we 
cannot expect an initialisation strategy to work uniformly well in all cases. Therefore, it 
is important to explore different strategies and to choose the solution with the highest log- 
likelihood value, but also to consider those sub-optimal solutions on a more subject specific 
considerations. 

The transformation-based initialisation strategies discussed in this paper are available in 
the R package mclust (version >= 4.4), which can be downloaded from CRAN at http: 
//cran.r-proj ect.org/web/packages/mclust/index.html, 
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